Interpretablog - Never Interpret Coefficients

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns
import scipy.stats as st

from statsmodels.miscmodels.ordinal_model import OrderedModel

import arviz as az
import bambi as bmb
import pymc as pm

Earlier in my career I made a mistake while communicating results of my work.

I needed to fit an ordered logit for a model and needed to communicate the effect of variables which I did via weight plot, where dots represent the estimated coefficient and the lines represent its standard error. However, it was misleading because coefficients of an ordered logit model are in log-odds scale which we are, as humans, not operating under intuitively. My lowest category had very high base rate, so those big coefficients weren’t shifting the probability mass much.

So, what should have been done? Well, I should have pushed predictions out of the model under different values of whatever the variable that was in the interest and show the change in the distribution. Like, computing the predicted values when its at 0 vs when its at 1. Let me show you an example below.

FREQUENTIST WAY

df = pd.read_csv('Trolley_data.csv', index_col=0)
df.head()

	case	response	order	id	age	edu	intention	contact	story	action2
1	cfaqu	4	2	96;434	14	Middle School	0	1	aqu	1
2	cfbur	3	31	96;434	14	Middle School	0	1	bur	1
3	cfrub	4	16	96;434	14	Middle School	0	1	rub	1
4	cibox	3	32	96;434	14	Middle School	1	1	box	1
5	cibur	3	4	96;434	14	Middle School	1	1	bur	1

Are you familiar with Trolley problems? If not, you haven’t been reading much ethics probably.

Trolley problems are ethical dilemmas designed in a specific way. There are various versions of it. Here’s one of them:

There’s a runaway trolley and it’s on a track towards five people who are tied up (they cannot escape). You’re near a lever and if you pull it, the road will diverge and the trolley will hit to a person. Should you pull the lever?

This is a dilemma, since it puts you into a situation where either 1 person or 5 people will lose their lives. In our dataset, we have different version of these scenarios (not necessarily trolleys) where each have different features which are reflected as action, intention, and contact in the dataframe.

Action means the person takes an action (like pulling the lever). In the intention scenario, the person pulls the lever and once the trolley hits to the first person it stops hence does not hit the other five (if you don’t, it will). In the contact version, the person pushes a guy and trolley hits the pushed person and stops. Again, stories doesn’t have to be trolley; you can see the variance in the story column.

Since idea of this post is to focus on visualization, we’ll move on with action, intention, and contact.

ol_m1 = OrderedModel(
    df['response'],
    df[['action', 'intention', 'contact']],
    distr='logit'
)

res_m1 = ol_m1.fit(method='bfgs', disp=False)
res_m1.summary()

OrderedModel Results
Dep. Variable:	response	Log-Likelihood:	-18536.
Model:	OrderedModel	AIC:	3.709e+04
Method:	Maximum Likelihood	BIC:	3.715e+04
Date:	Sun, 15 Feb 2026
Time:	22:22:14
No. Observations:	9930
Df Residuals:	9921
Df Model:	3

	coef	std err	z	P>\|z\|	[0.025	0.975]
action	-0.7091	0.041	-17.498	0.000	-0.789	-0.630
intention	-0.7205	0.037	-19.709	0.000	-0.792	-0.649
contact	-0.9614	0.050	-19.221	0.000	-1.059	-0.863
1/2	-2.8374	0.047	-60.363	0.000	-2.930	-2.745
2/3	-0.3829	0.032	-11.979	0.000	-0.446	-0.320
3/4	-0.5404	0.029	-18.541	0.000	-0.597	-0.483
4/5	0.0213	0.019	1.123	0.262	-0.016	0.058
5/6	-0.4032	0.025	-16.415	0.000	-0.451	-0.355
6/7	-0.0981	0.025	-3.921	0.000	-0.147	-0.049

Instead of relying on a table (which is almost always a bad idea, specifically in causal settings) that gives us coefficients on a scale that is not intuitive at all, let’s do what has been suggested above.

cats = res_m1.model.labels
cats

array([1, 2, 3, 4, 5, 6, 7])

new_df = pd.DataFrame({
    'action':[0, 0],
    'intention':[0, 1],
    'contact':[0, 0]
})
new_df

	action	intention	contact
0	0	0	0
1	0	1	0

So we vary the intention only, while keeping action and contact fixed. I’ll only do this comparison but others are valid as well.

df.groupby(['action', 'intention', 'contact']).size()

action  intention  contact
0       0          0          1655
                   1           993
        1          0          1986
                   1           993
1       0          0          2648
        1          0          1655
dtype: int64

res_m1.model.predict(res_m1.params, new_df, which='cumprob')

array([[0.05533774, 0.1038192 , 0.1717946 , 0.36552308, 0.52913695,
        0.73560502, 1.        ],
       [0.10746953, 0.19232512, 0.2989207 , 0.54216151, 0.69787495,
        0.85116506, 1.        ]])

preds = pd.DataFrame(res_m1.model.predict(res_m1.params, new_df, which='cumprob'))
preds

	0	1	2	3	4	5	6
0	0.055338	0.103819	0.171795	0.365523	0.529137	0.735605	1.0
1	0.107470	0.192325	0.298921	0.542162	0.697875	0.851165	1.0

plt.figure(figsize=(7, 5))

plt.plot(preds.iloc[0, :], label='A=0, I=0, C=0')
plt.plot(preds.iloc[1, :], label='A=0, I=1, C=0')

plt.legend()

Same thing can be done for each possible combination and included with subplots. I just wanted to convey the idea.

If you’re interested in Bayesian version, Bambi provides a very quick way to fit.

BAYESIAN WAY

df = pd.read_csv('Trolley_data.csv', usecols=['response', 'action', 'intention', 'contact'])
df.head()

	response	intention	contact
0	4	0	1
1	3	0	1
2	4	0	1
3	3	1	1
4	3	1	1

df['response'] = pd.Categorical(df['response'], ordered=True)

ol_m2 = bmb.Model('response ~ action + intention + contact', data=df, family='cumulative')
idata_m2 = ol_m2.fit()

/opt/anaconda3/envs/pymc_env/lib/python3.12/site-packages/bambi/formula.py:111: UserWarning: The intercept is omitted in ordinal families
  warnings.warn("The intercept is omitted in ordinal families")
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [threshold, action, intention, contact]

/opt/anaconda3/envs/pymc_env/lib/python3.12/site-packages/pytensor/compile/function/types.py:992: RuntimeWarning: invalid value encountered in accumulate
  outputs = vm() if output_subset is None else vm(output_subset=output_subset)

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 58 seconds.

az.summary(idata_m2)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
action	-0.704	0.041	-0.779	-0.622	0.001	0.001	2613.0	2819.0	1.0
contact	-0.956	0.050	-1.054	-0.867	0.001	0.001	2731.0	3021.0	1.0
intention	-0.716	0.036	-0.784	-0.651	0.001	0.000	3179.0	2883.0	1.0
threshold[0]	-2.830	0.047	-2.915	-2.742	0.001	0.001	1812.0	2499.0	1.0
threshold[1]	-2.148	0.043	-2.222	-2.064	0.001	0.001	2060.0	2642.0	1.0
threshold[2]	-1.565	0.039	-1.640	-1.494	0.001	0.001	2104.0	2526.0	1.0
threshold[3]	-0.544	0.037	-0.615	-0.475	0.001	0.001	2308.0	2756.0	1.0
threshold[4]	0.124	0.037	0.054	0.192	0.001	0.001	2524.0	3046.0	1.0
threshold[5]	1.031	0.039	0.958	1.106	0.001	0.001	2949.0	3336.0	1.0

grid = pd.DataFrame({
    'action':    [0, 0],
    'intention': [0, 1],
    'contact':   [0, 0],
})

idata_pred = ol_m2.predict(idata_m2, data=grid, kind='response_params', inplace=False)

p = idata_pred.posterior['p']

ps = p.stack(sample=('chain', 'draw')).transpose('sample', '__obs__', 'response_dim')

# CDF
cdfs = ps.cumsum(dim='response_dim')

cdf_mean = cdfs.mean(dim='sample')
cdf_lo   = cdfs.quantile(0.055, dim='sample')
cdf_hi   = cdfs.quantile(0.945, dim='sample')

x = np.arange(cdf_mean.sizes['response_dim'])

labels = ['A=0, I=0, C=0', 'A=0, I=1, C=0']

for i, lab in enumerate(labels):
    y  = cdf_mean.isel(__obs__=i).values
    lo = cdf_lo.isel(__obs__=i).values
    hi = cdf_hi.isel(__obs__=i).values

    plt.plot(x, y, label=lab)
    plt.fill_between(x, lo, hi, alpha=0.5)

plt.legend()
plt.tight_layout()
plt.show()